library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
# devtools::install_github("ropensci/monkeylearn", ref = "dev")
library(monkeylearn)
library(here)
## here() starts at /Users/amanda/Desktop/Projects/monkeylearn
source(here("blogpost", "key.R"))

This is a quick story all about how I started contributing to the rOpenSci package monkeylearn.

Things started at work, when I was looking around for an easy way to classify groups of products using R. I made the very clever first move of Googling “easy way to classify groups of products using R” and thanks to the magic of what I suppose used to be PageRank I landed upon a GitHub README for a package called monkeylearn.

A quick devtools::github_install("ropensci/monkeylearn") and creation of an API key later this seemed like the package to fit my needs. I loved that it sported only two functions, monkeylearn_classify() and monkeylearn_extract(), which did exactly what they said on the tin. They accept a vector of texts and return a dataframe of classifications or keyword extractions.

For a bit of background, the monkeylearn package hooks into the MonkeyLearn API, which uses natural language processing techniques to take a text input and either extract keywords from it or hand back a vector of classifications along with other data such as their confidence in the probability. There are a set of built-in “modules” but users can also create their own “custom” modules 1.

I discovered the only stumbling block for my use case was that texts sent to the MonkeyLearn API could not be batched. This wasn’t because the monkeylearn_classify() and monkeylearn_extract() functions themselves didn’t accept multiple inputs. Instead, it was because they didn’t explicitly relate inputs to outputs. This became a problem because inputs and outputs are not 1:1; if I send a vector of three texts for classification, my output dataframe might be 10 rows long. However, there was no way to know whether the first two or the first four rows, for example, belonged to my input 1.

texts <- c(
    "In a hole in the ground there lived a hobbit.",
    "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
    "When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton.")

(texts_out <- monkeylearn_classify(texts))
## Warning in strptime(x, fmt, tz = "GMT"): unknown timezone 'zone/tz/2018c.
## 1.0/zoneinfo/America/Chicago'
## # A tibble: 10 x 4
##    category_id probability              label
##          <int>       <dbl>              <chr>
##  1    18313280       0.071              Music
##  2    18313502       0.054        Music DVD's
##  3    18313524       0.553 See All Music DVDs
##  4    18314767       0.062              Books
##  5    18314954       0.047 Mystery & Suspense
##  6    18314957       0.102  Police Procedural
##  7    18313210       0.082  Party & Occasions
##  8    18313231       0.176    Party Supplies 
##  9    18313235       0.134  Party Decorations
## 10    18313236       0.406        Decorations
## # ... with 1 more variables: text_md5 <chr>

This works great if you don’t care about classifying your inputs independently of one another. (Say, you’re interested in classifying a whole chapter of a book.) In my case, though, my inputs were independent of one another and had to be classified independently.

Initial Workaround

My initial approach to this was to simply treat each text as a seaparate call. I wrapped monkeylearn_classify() in a function that would send a vector of texts and return a dataframe relating the input in one column to the output in the others. It was essentially the below, with some more error handling and bells and whistles particular to the problem.

initial_workaround <- function(df, col, key) {
  
  quo_col <- enquo(col)
  
  out <- df %>% 
    mutate(
      tags = NA_character_
    )
  
  for (i in 1:nrow(df)) {
    this_text <- df %>% select(!!quo_col) %>% slice(i) %>% as_vector()
    this_out <- monkeylearn_classify(this_text, key = key) %>% list()
    out[i, ]$tags <- this_out
  }

  return(out)
}

Since it takes a dataframe rather than a vector, let’s turn our sample into a tibble.

texts_df <- tibble(texts)

And run the workaround:

(initial_out <- initial_workaround(texts_df, texts, key = ml_key))
## # A tibble: 3 x 2
##                                                                         texts
##                                                                         <chr>
## 1                               In a hole in the ground there lived a hobbit.
## 2 It is a truth universally acknowledged, that a single man in possession of 
## 3 When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebr
## # ... with 1 more variables: tags <list>

We see that this retains the 1:1 relationship between input and output, but still allows the output list-col to be unnested.

initial_out %>% unnest()
## # A tibble: 10 x 5
##                                                                          texts
##                                                                          <chr>
##  1                               In a hole in the ground there lived a hobbit.
##  2                               In a hole in the ground there lived a hobbit.
##  3                               In a hole in the ground there lived a hobbit.
##  4 It is a truth universally acknowledged, that a single man in possession of 
##  5 It is a truth universally acknowledged, that a single man in possession of 
##  6 It is a truth universally acknowledged, that a single man in possession of 
##  7 When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebr
##  8 When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebr
##  9 When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebr
## 10 When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebr
## # ... with 4 more variables: category_id <int>, probability <dbl>,
## #   label <chr>, text_md5 <chr>

But, the catch: this appraoch was quite slow. The bottleneck wasn’t the for loop; it was the round trip to the MonkeyLearn API for each individual text. For just these three meager texts, initial_workaround() takes a painfully long time.

system.time(initial_workaround(texts_df, texts, key = ml_key))
##    user  system elapsed 
##   0.035   0.001   6.572

Even classifying relatively “small” data – around 70,000 texts – was going to take a looong time. I updated the function to write each row out to an RDS file after it was classified inside the loop (along the lines of write_rds(out[i, ], glue::glue("some_directory/{i}.rds"))) so that I wouldn’t have to rely on the funciton successfully finishing execution in one run. Still, I didn’t like my options.

This job was intended to be run every night, and with an uncertain amount of text data coming in every day, I didn’t want it to run for more than 24 hours one day and either a) prevent the next night’s job from running or b) necessitate spinning up a second server to handle the next night’s data.

Diving In

Now that I’m like

I’m just about at the point where I have to start making myself useful.

My specific question was about the mechanics of how the batching is done. I’d seen in the package docs and on the MonkeyLearn FAQ that batching up to 200 texts was possible. Batching doesn’t save you on requests (sending 200 texts in a batch means you now have 200 fewer queries), but it does save you bigtime on speed.

Was MonkeyLearn returning JSON that severed the ties between input and output? I sort of doubted it. You’d think that an API that was sent a JSON array of inputs would send back an array to match. My huch was that either the package was concatenating the input (which would save user on API queries) or rbinding the output.

I forked the package and set about rummaging through the source code. Blissfully, everything is nicely commented and the code is quite readable.

Inside monkeylearn_classify() spot a call to monkeylearn_parse() which I suspect is a utility function. I find it in utils.R.

The lines in monkeylearn_parse() that matter for our purposes are:

temp <- jsonlite::fromJSON(text)
if(length(temp$result[[1]]) != 0){
  results <- do.call("rbind", temp$result)
}

So this is where the rowbinding happens!

I set about copying monkeylearn_parse() and doing a bit of surgery on it to create monkeylearn_parse_each() which skips the rbinding and retains the list structure of each output. The rest of the function was so airtight I didn’t have to change much more to get it working.

At this point, I’d plumbed through the guts of the package and thought that such a function might be useful to some other people out there.

I started writing a new function with an eye toward making a pull request.

Since I’d found it useful to be able to pass in a dataframe, I figured I’d retain that aspect of the function. I wanted users to still be able to pass in a bare column name but the package seemed to be light on the tidyverse so I un-tidyeval’d it (using deparse(substitute(col)) instead of a quosure) and gave it the imaginative name…monkeylearn_classify_df.

Along the way I also caught a couple minor bugs (things like the remnants of a for loop remaining in what had been revamped into a while loop).

After a few more checks I wrote up a quick pull request and checked the package contributors to see if I knew anyone. Far and away the main contributor was Maëlle Salmon! I’d heard of her through the magic of #rstats Twitter and the R-Ladies Global Slack. A minute or two after submitting the PR I headed over to Slack to give her a heads up that a PR would be heading her way.

In what I would come to know as her usual cheerful, perpetually-on-top-of-it form, Maëlle had already seen it and liked the idea.

Continuing Work

To make a short story shorter, I began working on the package, chatting with Maëlle over rOpenSci Slack about tradeoffs like which package dependencies we were okay with taking on, whether to go the tidyeval or base route, what the best naming conventions for the new functions should be, etc.

On the naming front, we decided to slowly phase out monkeylearn_classify and monkeylearn_extract as the newer functions could cover all of the functionality of the older guys. I don’t know much about cache invalidation, but the naming problem was hard as usual. We settled on naming their counterparts monkey_classify (which replaced the original monkeylearn_classify_df) and monkey_extract.

git flow

Early on we started talking git best pactices. I floated a structure that we typically follow at my company, where each ticket (or in this case, issue) becomes its own branch off of the development branch. For instance, issue #33 becomes branch T33 (T for ticket). This approach, I am told, stems from the “git flow” philosophy which is one of many ways to structure a git workflow that mostly doesn’t end in tears.

The idea here is to make pull requests as bite-sized as possible. You minimize merge conflicts by assinging each ticket/issue to only one person. (An added benefit for me, at least, is that it keeps me from wandering off into other parts of the code that I notice could be improved without first documenting them and making a separate issue and branch.) This method also leaves a nice paper trail because the branch directly references the issue in its title, so you don’t have to explicitly name the issue in the commit or rely on GitHub’s (albeit awesome) keyword branch closing system.

While this branch naming convention isn’t particular to gitflow (to my knowledge), it did spark a short conversation on Twitter that it might be worth having around the naming conventions

Finally, since the system is so tied to issues themselves, it encourages very frequent communication between collaborators, provided, of course, that everyone’s more or less keeping up with new issues. Since the issue must necessarily be made before the branch and the accompanying changes to the code, the other contributors have a chance to weigh in on the issue or the approach suggested in its comments before any code is written.

It has worked out swimmingly for us thus far, minimizing disaster and making communication the path of least resistance. Despite the time difference and 👶(!), communication hasn’t been an issue for us at all. Maëlle’s been a fantastic mentor through and through.

Package Improvements

As I mentioned, the package was so good to begin with it was difficult to find ways to improve it. Most of the work I did was to improving the new monkey_ functions.

They got more informative messages about which batches were being processed and which texts those batches corresponsed to. Rather than discarding inputs such as empty strings that could not be sent to the API, they now return an output of the same dimensions as the input, which can be unnested with either an unnest flag, or after the fact.

text_w_empties <- c(
    "In a hole in the ground there lived a hobbit.",
    "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
    "",
    "When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton.",
    " ")

(empties_out <- monkey_classify(text_w_empties, texts_per_req = 2, unnest = TRUE))
## The following indices were empty strings and could not be sent to the API: 3
##         They will still be included in the output.
## Processing batch 1 of 2 batches: texts 1 to 2
## Processing batch 2 of 2 batches: texts 2 to 4
## # A tibble: 12 x 4
##                                                                            req
##                                                                          <chr>
##  1                               In a hole in the ground there lived a hobbit.
##  2                               In a hole in the ground there lived a hobbit.
##  3                               In a hole in the ground there lived a hobbit.
##  4 It is a truth universally acknowledged, that a single man in possession of 
##  5 It is a truth universally acknowledged, that a single man in possession of 
##  6 It is a truth universally acknowledged, that a single man in possession of 
##  7                                                                            
##  8 When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebr
##  9 When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebr
## 10 When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebr
## 11 When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebr
## 12                                                                            
## # ... with 3 more variables: category_id <int>, probability <dbl>,
## #   label <chr>

Concluding words

Something I’ve been thinking about while working on the twin functions monkey_extract and monkey_classify is what the best practice is for developing very similar functions in tandem. The functions are different enough to have different default values (monkey_extract has a default extractor_id while monkey_classify of course has a default classifier_id) but are very similar in other regards.

Is it better to port all changes made to one over to the other with every update? Or is it instead better to work on one function at a time, and, at some checkpoints then batch these changes over to the other function in a big copy-paste job?

Since there are only two functions to worry about here, creating a function factory to handle them seemed like overkill.


  1. Custom, to a point. As of this writing, two types of classifier models you can create use either Naive Bayes or Support Vector Machines, though you can specify other parameters such as use_stemmer and strip_stopwords.